ALLOC: Fail Stale Primary Alloc. Req. without Data #37226

original-brownbear · 2019-01-08T14:47:55Z

Get indices shard store status before enqueuing the reallocation state update task to prevent
tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis
Closes Allocate_stale_primary appears to succeed on wrong node #37098

* Get indices shard store status before enqueuing the reallocation state update task to prevent tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis * Closes elastic#37098

elasticmachine · 2019-01-08T15:11:37Z

Pinging @elastic/es-distributed

original-brownbear · 2019-01-08T15:13:06Z

server/src/test/java/org/elasticsearch/cluster/routing/AllocationIdIT.java

@@ -135,7 +134,7 @@ public void testFailedRecoveryOnAllocateStalePrimaryRequiresAnotherAllocateStale
            assertThat(shardRouting.unassignedInfo().getReason(), equalTo(UnassignedInfo.Reason.ALLOCATION_FAILED));
        });

-        try(Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {
+        try (Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {


Sorry for these two noisy cleanups that snuck into this file

I think we should back them out for the sake of future git annotate users :)

DaveCTurner

Looks good. I left a small number of small comments.

I would rather avoid the // code comments that you added - they don't really say anything that isn't already clear to me from the code, and I always worry about comments like this that fall out of sync later.

DaveCTurner · 2019-01-08T18:05:35Z

.../main/java/org/elasticsearch/action/admin/cluster/reroute/TransportClusterRerouteAction.java

+        for (AllocationCommand command : request.getCommands().commands()) {
+            if (command instanceof AllocateStalePrimaryAllocationCommand) {
+                if (stalePrimaryAllocations == null) {
+                    stalePrimaryAllocations = new HashMap<>();


I think we can afford to make this HashMap eagerly and avoid the noise in the loop.

DaveCTurner · 2019-01-08T18:15:30Z

server/src/test/java/org/elasticsearch/cluster/routing/PrimaryAllocationIT.java

+            IllegalArgumentException.class,
+            () -> client().admin().cluster().prepareReroute().add(new AllocateStalePrimaryAllocationCommand("test", 0,
+            dataNodeWithNoShardCopy, true)).get());
+        assertThat(iae.getMessage(), equalTo("No data for shard [0] of index [test] found on node [" + dataNodeWithNoShardCopy + ']'));

        logger.info("--> wait until shard is failed and becomes unassigned again");
        assertBusy(() ->


I think we no longer want this to be an assertBusy() - we expect that the previous reroute didn't do anything, so it should still be in this state from beforehand.

DaveCTurner · 2019-01-08T18:16:35Z

server/src/test/java/org/elasticsearch/cluster/routing/AllocationIdIT.java

@@ -135,7 +134,7 @@ public void testFailedRecoveryOnAllocateStalePrimaryRequiresAnotherAllocateStale
            assertThat(shardRouting.unassignedInfo().getReason(), equalTo(UnassignedInfo.Reason.ALLOCATION_FAILED));
        });

-        try(Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {
+        try (Store store = new Store(shardId, indexSettings, new SimpleFSDirectory(indexPath), new DummyShardLock(shardId))) {


I think we should back them out for the sake of future git annotate users :)

…roblem

original-brownbear · 2019-01-08T18:47:46Z

@DaveCTurner all points addressed, thanks for taking a look!

…roblem

original-brownbear · 2019-01-08T20:57:31Z

Jenkins run gradle build tests 1

original-brownbear · 2019-01-08T21:03:54Z

Jenkins run Gradle build tests 2

…roblem

DaveCTurner

Implementation looks good, I asked for a couple more tests.

DaveCTurner · 2019-01-09T08:47:22Z

.../main/java/org/elasticsearch/action/admin/cluster/reroute/TransportClusterRerouteAction.java

+                            final String index = entry.getKey();
+                            final ImmutableOpenIntMap<List<IndicesShardStoresResponse.StoreStatus>> indexStatus = status.get(index);
+                            if (indexStatus == null) {
+                                e = ExceptionsHelper.useOrSuppress(e, new IndexNotFoundException(index));


Could we have a test that hits this branch?

Actually no ... my bad this one is dead code. The logic in the index shard store status request already checks the index exists.
I added an assertion for this now and a test that makes sure it's not tripped when the index doesn't exist :)

DaveCTurner · 2019-01-09T08:47:37Z

.../main/java/org/elasticsearch/action/admin/cluster/reroute/TransportClusterRerouteAction.java

+                                        indexStatus.get(command.shardId());
+                                    if (shardStatus == null) {
+                                        e = ExceptionsHelper.useOrSuppress(e, new IllegalArgumentException(
+                                            "No data for shard [" + command.shardId() + "] of index [" + index + "] found on any node")


Could we have a test that hits this branch?

Done in ea98277 :)

…roblem

original-brownbear · 2019-01-09T11:11:52Z

@DaveCTurner thanks for taking a look => tests added and implementation simplified a little as well in ea98277 :)

original-brownbear · 2019-01-10T14:20:14Z

@DaveCTurner ping :)

DaveCTurner

LGTM

original-brownbear · 2019-01-10T15:28:25Z

@DaveCTurner thanks!

* Get indices shard store status before enqueuing the reallocation state update task to prevent tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis * Closes #37098

ALLOC: Fail Stale Primary Alloc. Req. without Data

56d2de3

* Get indices shard store status before enqueuing the reallocation state update task to prevent tasks that would fail because a node does not hold a stale copy of the shard on a best effort basis * Closes elastic#37098

original-brownbear added >bug v7.0.0 :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.7.0 labels Jan 8, 2019

original-brownbear added 3 commits January 8, 2019 15:50

fix formatting

740dd71

nicer format

be431a8

nicer format

e4fe2ad

original-brownbear commented Jan 8, 2019

View reviewed changes

original-brownbear requested a review from DaveCTurner January 8, 2019 15:13

DaveCTurner reviewed Jan 8, 2019

View reviewed changes

original-brownbear added 2 commits January 8, 2019 19:40

Merge remote-tracking branch 'elastic/master' into stale-allocation-p…

c7eb586

…roblem

cr comments

75afbf9

original-brownbear requested a review from DaveCTurner January 8, 2019 18:47

original-brownbear added 2 commits January 8, 2019 21:22

Merge remote-tracking branch 'elastic/master' into stale-allocation-p…

339df06

…roblem

fix incorrect condition

29d594a

Merge remote-tracking branch 'elastic/master' into stale-allocation-p…

9fe0321

…roblem

DaveCTurner reviewed Jan 9, 2019

View reviewed changes

original-brownbear added 2 commits January 9, 2019 10:18

Merge remote-tracking branch 'elastic/master' into stale-allocation-p…

bcc8cc8

…roblem

CR: more tests + remove dead code

ea98277

original-brownbear requested a review from DaveCTurner January 9, 2019 11:12

DaveCTurner approved these changes Jan 10, 2019

View reviewed changes

original-brownbear merged commit 46237fa into elastic:master Jan 10, 2019

original-brownbear deleted the stale-allocation-problem branch January 10, 2019 15:28

original-brownbear added the backport pending label Jan 10, 2019

alpar-t mentioned this pull request Jan 11, 2019

CI: test failure PrimaryAllocationIT.testForceStaleReplicaToBePromotedToPrimaryOnWrongNode #37345

Closed

original-brownbear removed the backport pending label Jan 28, 2019

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

original-brownbear mentioned this pull request Feb 12, 2019

Forbid allocate_empty_primary if Stale Copy Exists? #38763

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ALLOC: Fail Stale Primary Alloc. Req. without Data #37226

ALLOC: Fail Stale Primary Alloc. Req. without Data #37226

original-brownbear commented Jan 8, 2019

elasticmachine commented Jan 8, 2019

original-brownbear Jan 8, 2019

DaveCTurner Jan 8, 2019

DaveCTurner left a comment

DaveCTurner Jan 8, 2019

DaveCTurner Jan 8, 2019

DaveCTurner Jan 8, 2019

original-brownbear commented Jan 8, 2019

original-brownbear commented Jan 8, 2019

original-brownbear commented Jan 8, 2019

DaveCTurner left a comment

DaveCTurner Jan 9, 2019

original-brownbear Jan 9, 2019

DaveCTurner Jan 9, 2019

original-brownbear Jan 9, 2019

original-brownbear commented Jan 9, 2019

original-brownbear commented Jan 10, 2019

DaveCTurner left a comment

original-brownbear commented Jan 10, 2019

ALLOC: Fail Stale Primary Alloc. Req. without Data #37226

ALLOC: Fail Stale Primary Alloc. Req. without Data #37226

Conversation

original-brownbear commented Jan 8, 2019

elasticmachine commented Jan 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Jan 8, 2019

original-brownbear commented Jan 8, 2019

original-brownbear commented Jan 8, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Jan 9, 2019

original-brownbear commented Jan 10, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

original-brownbear commented Jan 10, 2019